Data, DataViz, and Stats with the Stars

Agenda!

  • Orange? What is this Orange stuff, anyhow?
  • Throwing it All Away with Brad Pitt: Data Summaries
  • Counting Letters with Sherlock Holmes: Bar Charts
  • Nursery Rhymes with Ben Affleck: Line Charts
  • Being a Mermaid with Katie Ledecky: Box Plots
  • Jack and Rose lived happily ever after: Mosaic Plots
  • The Art of Surprise with Gabbar Singh: Permutation Tests

Orange? What is this Orange stuff, anyhow?

Installing and Getting Used to Orange

some stuff here

Brad Pitt: Throwing it All Away

Brad Pitt: Throwing it All Away

Brad Pitt: Throwing it All Away

Steven Stigler (2016) in “The Seven Pillars of Statistical Wisdom”:

  • One of the Big Ideas in Statistics is: Aggregation
  • How is it revolutionary?
  • By stipulating that, given a number of observations, you can actually gain information by throwing information away
  • In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Brad Pitt: Throwing it All Away

What was he throwing away?

data table here

“OBP” as aggregate column explanation here

Counting Letters with Sherlock Holmes

Sherlock Holmes: The Adventure of the Dancing Men

In the Sherlock Holmes story, The Adventure of the Dancing Men, a criminal known to one of the characters communicates with her using a childish/child-like drawing which looks like this:

Am Here, Abe Slaney

Am Here, Abe Slaney

How would Holmes decipher this message?

Sherlock Holmes: The Adventure of the Dancing Men

  • Using Conjectures: Symbols -> Letters
    • Holmes deduces that the most common letter in the message is “E”
    • He then deduces that the second most common letter is “T”

Zipf’s Law

Zipf’s Law
  • Based on well-known Counts of letters (Zipf’s Law)

What Charts work for counting?

Variable #1 Variable #2 Chart Names Chart Shape
Qual None Bar Chart

Bar are used to show “counts” and “tallies” with respect to Qual variables. For instance, in a survey, how many people vs Gender? In a Target Audience survey on Weekly Consumption, how many low, medium, or high expenditure people?

Where’s our Data?

OK, Let’s get some data to count:

Penguins Data

And let’s for now use a pre-set Workflow in Orange

Barchart Workflow

Workflow#1: Bar Charts

  • We will look at the data
  • Make a Data dictionary
  • Identify the Qual and Quant variables
  • Prepare Counts and Bar Charts wrt Qual variables
  • In Orange! Point, Click, and See!

Data Dictionary

Qualitative Variables

  • species: Species of the penguin (Qual)
  • island: Island where the penguin was observed (Qual)
  • sex: Male / Female penguin (Qual)

Quantitative Variables

  • bill_length_mm: Length of the penguin’s bill in millimeters (Quant)
  • bill_depth_mm: Depth of the penguin’s bill in millimeters (Quant)
  • flipper_length_mm: Length of the penguin’s flipper in millimeters (Quant)
  • body_mass_g: Mass of the penguin in grams (Quant)

Counting our Data

Research Question

Are there more penguins of some species than there are in others?

Research Question

Are the penguin polulations on different island the same?

Wait, But Why?

  • Counts first give you an absolute sense of how much data you have.

  • Counts by different Qual variables give you a sense of the combinations you have in your data: (Male/Female) * (Species) * (Island) (Say 2 * 3 * 3 = 18 combinations in the data)

  • Counts then give an idea whether your data is lop-sided

  • Since the X-axis in bar charts is Qualitative (the bars don’t touch, remember!) it is possible to sort the bars at will.

Nursery Rhymes with Ben Affleck

Who was Solomon Grundy?

Who was Solomon Grundy?

Being a Mermaid with Katie Ledecky

Being a Mermaid with Katie Ledecky

Jack and Rose lived happily ever after

Jack and Rose lived happily ever after?

  • What are the chances?
  • What did the chances depend on?

Jack and Rose lived happily ever after?

  • Let’s get the titanic data, using the Datasets widget in Orange.

  • There were 2201 passengers, as per this dataset.

  • And let’s use a pre-set Workflow in Orange

Mosaic Chart Workflow

Data Dictionary: titanic

Quantitative Data

None.

Qualitative Data

  • survived: (chr) yes or no
  • status: (chr) Class of Travel, else “crew”
  • age: (chr) Adult, Child
  • sex: (chr) Male / Female.

What kind of Data Variables will we choose?

Variable #1 Variable #2 Chart Names Chart Shape
Qual Qual Pies, Mosaic Charts

Here, area \sim count, so the area of the tile is proportional to the count of observations in that tile.

Research Question #1

Note

What is the dependence of survived upon sex?

Note

  • Note the huge imbalance in survived with sex
  • Men have clearly perished in larger numbers than women.
  • Colouring shows large positive residuals for men who died, and large negative residuals for women who died.

So sadly Jack is far more likely to have died than Rose.

Research Question #2

How does survived depend upon status?

Note

  • Crew has seen deaths in large numbers,
    • as seen by the large negative residual for crew-survivals.
  • First Class passengers have had speedy access to the boats and have survived in larger proportions than say second or third class.
  • There is a large positive residual for first-class survivals.
  • Rose travelled first class and Jack was third class. So again the odds are stacked against him.

What are these Residuals anyhow?

When differences between the actual and expected counts are large, we deduce that one Qual variable has an effect on the other Qual variable. (speaking counts-wise or ratio-wise)

Actual Counts

Actual Counts

Expected Counts!!

Expected Counts!!

Tile-Wise Differences = Residuals

Tile-Wise Differences = Residuals

The Art of Surprise with Gabbar Singh

The Art of Surprise with Gabbar Singh

Why is this slide always showing up?